{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "09_Preprocessing",
      "provenance": [],
      "collapsed_sections": [],
      "toc_visible": true
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "accelerator": "GPU"
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "eTdCMVl9YAXw",
        "colab_type": "text"
      },
      "source": [
        "<a href=\"https://practicalai.me\"><img src=\"https://raw.githubusercontent.com/practicalAI/images/master/images/rounded_logo.png\" width=\"100\" align=\"left\" hspace=\"20px\" vspace=\"20px\"></a>\n",
        "\n",
        "<img src=\"https://cdn4.iconfinder.com/data/icons/data-analysis-flat-big-data/512/data_cleaning-512.png\" width=\"90px\" vspace=\"10px\" align=\"right\">\n",
        "\n",
        "<div align=\"left\">\n",
        "<h1>Preprocessing</h1>\n",
        "In this lesson, we will explore preprocessing and data loading utilities in Tensorflow + Keras, mainly focused on text-based data.",
        "</div>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "xuabAj4PYj57",
        "colab_type": "text"
      },
      "source": [
        "<table align=\"center\">\n",
        "  <td>\n",
        "<img src=\"https://raw.githubusercontent.com/practicalAI/images/master/images/rounded_logo.png\" width=\"25\"><a target=\"_blank\" href=\"https://practicalai.me\"> View on practicalAI</a>\n",
        "  </td>\n",
        "  <td>\n",
        "<img src=\"https://raw.githubusercontent.com/practicalAI/images/master/images/colab_logo.png\" width=\"25\"><a target=\"_blank\" href=\"https://colab.research.google.com/github/practicalAI/practicalAI/blob/master/notebooks/09_Preprocessing.ipynb\"> Run in Google Colab</a>\n",
        "  </td>\n",
        "  <td>\n",
        "<img src=\"https://raw.githubusercontent.com/practicalAI/images/master/images/github_logo.png\" width=\"22\"><a target=\"_blank\" href=\"https://github.com/practicalAI/practicalAI/blob/master/notebooks/basic_ml/09_Preprocessing.ipynb\"> View code on GitHub</a>\n",
        "  </td>\n",
        "</table>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "VoMq0eFRvugb",
        "colab_type": "text"
      },
      "source": [
        "# Overview"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "JqxyljU18hvt",
        "colab_type": "text"
      },
      "source": [
        "* **Tokenizer**: data processing unit to convert text data to tokens\n",
        "* **LabelEncoder**: convert text labels to tokens"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "rmvSOsB-CuJb",
        "colab_type": "text"
      },
      "source": [
        "# Set up"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "bpFe3zVoCuTD",
        "colab_type": "code",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 34
        },
        "outputId": "e17fb5a6-6d62-46c5-fc40-ee79cde7f56d"
      },
      "source": [
        "# Use TensorFlow 2.x\n",
        "%tensorflow_version 2.x"
      ],
      "execution_count": 1,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "TensorFlow 2.x selected.\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "2PXdZbDzCuRE",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "import os\n",
        "import numpy as np\n",
        "import tensorflow as tf"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "f-NnTT8LDjf3",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "# Arguments\n",
        "SEED = 1234\n",
        "DATA_FILE = 'news.csv'\n",
        "SHUFFLE = True\n",
        "INPUT_FEATURE = 'title'\n",
        "OUTPUT_FEATURE = 'category'\n",
        "LOWER = True\n",
        "CHAR_LEVEL = False\n",
        "TRAIN_SIZE = 0.7\n",
        "VAL_SIZE = 0.15\n",
        "TEST_SIZE = 0.15\n",
        "NUM_EPOCHS = 10\n",
        "BATCH_SIZE = 32"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "nB7mfVd7Dn9e",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "# Set seed for reproducibility\n",
        "np.random.seed(SEED)\n",
        "tf.random.set_seed(SEED)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "XtKqNioAayCy",
        "colab_type": "text"
      },
      "source": [
        "# Load data"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "X3OrtMpFayFC",
        "colab_type": "text"
      },
      "source": [
        "We will download the [AG News dataset](http://www.di.unipi.it/~gulli/AG_corpus_of_news_articles.html), which consists of 120000 text samples from 4 unique classes ('Business', 'Sci/Tech', 'Sports', 'World')"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "9NfIz_4OPYpG",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "import pandas as pd\n",
        "import re\n",
        "import urllib"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "qPGNh-26V5gU",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "# Upload data from GitHub to notebook's local drive\n",
        "url = \"https://raw.githubusercontent.com/practicalAI/practicalAI/master/data/news.csv\"\n",
        "response = urllib.request.urlopen(url)\n",
        "html = response.read()\n",
        "with open(DATA_FILE, 'wb') as fp:\n",
        "    fp.write(html)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "q14QCqKxUS2v",
        "colab_type": "code",
        "outputId": "f8f9b1a3-607b-42a5-df6f-0825639fa2a1",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 204
        }
      },
      "source": [
        "# Load data\n",
        "df = pd.read_csv(DATA_FILE, header=0)\n",
        "X = df[INPUT_FEATURE].values\n",
        "y = df[OUTPUT_FEATURE].values\n",
        "df.head(5)"
      ],
      "execution_count": 7,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/html": [
              "<div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>title</th>\n",
              "      <th>category</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>Wall St. Bears Claw Back Into the Black (Reuters)</td>\n",
              "      <td>Business</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>1</th>\n",
              "      <td>Carlyle Looks Toward Commercial Aerospace (Reu...</td>\n",
              "      <td>Business</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2</th>\n",
              "      <td>Oil and Economy Cloud Stocks' Outlook (Reuters)</td>\n",
              "      <td>Business</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>3</th>\n",
              "      <td>Iraq Halts Oil Exports from Main Southern Pipe...</td>\n",
              "      <td>Business</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>4</th>\n",
              "      <td>Oil prices soar to all-time record, posing new...</td>\n",
              "      <td>Business</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>"
            ],
            "text/plain": [
              "                                               title  category\n",
              "0  Wall St. Bears Claw Back Into the Black (Reuters)  Business\n",
              "1  Carlyle Looks Toward Commercial Aerospace (Reu...  Business\n",
              "2    Oil and Economy Cloud Stocks' Outlook (Reuters)  Business\n",
              "3  Iraq Halts Oil Exports from Main Southern Pipe...  Business\n",
              "4  Oil prices soar to all-time record, posing new...  Business"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 7
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "WqQSfd6OjQR4",
        "colab_type": "text"
      },
      "source": [
        "# Preprocess data"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "UZPv7U5hjRnT",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "def preprocess_text(text):\n",
        "    \"\"\"Common text preprocessing steps.\"\"\"\n",
        "    # Remove unwanted characters\n",
        "    text = re.sub(r\"[^0-9a-zA-Z?.!,¿]+\", \" \", text)\n",
        "\n",
        "    # Add space between words and punctuations\n",
        "    text = re.sub(r\"([?.!,¿])\", r\" \\1 \", text)\n",
        "    text = re.sub(r'[\" \"]+', \" \", text)\n",
        "\n",
        "    # Remove whitespaces\n",
        "    text = text.rstrip().strip()\n",
        "\n",
        "    return text"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "bRd0rgMojRqR",
        "colab_type": "code",
        "outputId": "a9036653-32ac-41e7-8722-103bfb3cc3fd",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 204
        }
      },
      "source": [
        "# Preprocess the titles\n",
        "df.title = df.title.apply(preprocess_text)\n",
        "df.head(5)"
      ],
      "execution_count": 9,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/html": [
              "<div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>title</th>\n",
              "      <th>category</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>Wall St . Bears Claw Back Into the Black Reuters</td>\n",
              "      <td>Business</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>1</th>\n",
              "      <td>Carlyle Looks Toward Commercial Aerospace Reuters</td>\n",
              "      <td>Business</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2</th>\n",
              "      <td>Oil and Economy Cloud Stocks Outlook Reuters</td>\n",
              "      <td>Business</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>3</th>\n",
              "      <td>Iraq Halts Oil Exports from Main Southern Pipe...</td>\n",
              "      <td>Business</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>4</th>\n",
              "      <td>Oil prices soar to all time record , posing ne...</td>\n",
              "      <td>Business</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>"
            ],
            "text/plain": [
              "                                               title  category\n",
              "0   Wall St . Bears Claw Back Into the Black Reuters  Business\n",
              "1  Carlyle Looks Toward Commercial Aerospace Reuters  Business\n",
              "2       Oil and Economy Cloud Stocks Outlook Reuters  Business\n",
              "3  Iraq Halts Oil Exports from Main Southern Pipe...  Business\n",
              "4  Oil prices soar to all time record , posing ne...  Business"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 9
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "KJ7adzmAEAFM",
        "colab_type": "text"
      },
      "source": [
        "<img width=\"45\" src=\"http://bestanimations.com/HomeOffice/Lights/Bulbs/animated-light-bulb-gif-29.gif\" align=\"left\" vspace=\"5px\" hspace=\"10px\">\n",
        "\n",
        "If you have preprocessing steps like standardization, etc. that are calculated, you need to separate the training and test set first before spplying those operations. This is because we cannot apply any knowledge gained from the test set accidentally during preprocessing/training. However for preprocessing steps like the function above where we aren't learning anything from the data itself, we can perform before splitting the data."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "0kEIDV2rWkPA",
        "colab_type": "text"
      },
      "source": [
        "# Split data"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "VNHCAonhWpQL",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "import collections\n",
        "from sklearn.model_selection import train_test_split"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "eU6CpWm7Ql9W",
        "colab_type": "text"
      },
      "source": [
        "### Components"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "NSffHAoeQmBM",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "def train_val_test_split(X, y, val_size, test_size, shuffle):\n",
        "    \"\"\"Split data into train/val/test datasets.\"\"\"\n",
        "    X_train, X_test, y_train, y_test = train_test_split(\n",
        "        X, y, test_size=test_size, stratify=y, shuffle=shuffle)\n",
        "    X_train, X_val, y_train, y_val = train_test_split(\n",
        "        X_train, y_train, test_size=val_size, stratify=y_train, shuffle=shuffle)\n",
        "    return X_train, X_val, X_test, y_train, y_val, y_test"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "0Z0hVCj_QUHZ"
      },
      "source": [
        "### Operations"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "fEdM6iLjV_AY",
        "colab_type": "code",
        "outputId": "b0e7191a-9521-4030-d86b-f6d2029fbe53",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 119
        }
      },
      "source": [
        "# Create data splits\n",
        "X_train, X_val, X_test, y_train, y_val, y_test = train_val_test_split(\n",
        "    X=X, y=y, val_size=VAL_SIZE, test_size=TEST_SIZE, shuffle=SHUFFLE)\n",
        "class_counts = dict(collections.Counter(y))\n",
        "print (f\"X_train: {X_train.shape}, y_train: {y_train.shape}\")\n",
        "print (f\"X_val: {X_val.shape}, y_val: {y_val.shape}\")\n",
        "print (f\"X_test: {X_test.shape}, y_test: {y_test.shape}\")\n",
        "print (f\"X_train[0]: {X_train[0]}\")\n",
        "print (f\"y_train[0]: {y_train[0]}\")\n",
        "print (f\"Classes: {class_counts}\")"
      ],
      "execution_count": 12,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "X_train: (86700,), y_train: (86700,)\n",
            "X_val: (15300,), y_val: (15300,)\n",
            "X_test: (18000,), y_test: (18000,)\n",
            "X_train[0]: PGA overhauls system for Ryder Cup points\n",
            "y_train[0]: Sports\n",
            "Classes: {'Business': 30000, 'Sci/Tech': 30000, 'Sports': 30000, 'World': 30000}\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "yh_Vox49XtlU",
        "colab_type": "text"
      },
      "source": [
        "# Tokenizer"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "EBiqvYWJXvY1",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "from tensorflow.keras.preprocessing.text import Tokenizer"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "AlI6O6QrYVLl",
        "colab_type": "text"
      },
      "source": [
        "### Operations"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "KsvjmXmIYVT5",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "# Input vectorizer\n",
        "X_tokenizer = Tokenizer(lower=LOWER,\n",
        "                        char_level=CHAR_LEVEL,\n",
        "                        oov_token='<UNK>')"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "aljdMgsdYlRl",
        "colab_type": "code",
        "outputId": "9b58ff52-1b54-42e2-ff84-4183839075ce",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 34
        }
      },
      "source": [
        "# Fit only on train data\n",
        "X_tokenizer.fit_on_texts(X_train)\n",
        "vocab_size = len(X_tokenizer.word_index) + 1\n",
        "print (f\"# tokens: {vocab_size}\")"
      ],
      "execution_count": 15,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "# tokens: 29782\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "5old1UfyY-EQ",
        "colab_type": "code",
        "outputId": "d3876b6e-9169-41f8-b93e-1bee4036e558",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 68
        }
      },
      "source": [
        "# Convert text to sequence of tokens\n",
        "print (f\"X_train[0]: {X_train[0]}\")\n",
        "X_train = np.array(X_tokenizer.texts_to_sequences(X_train))\n",
        "X_val = np.array(X_tokenizer.texts_to_sequences(X_val))\n",
        "X_test = np.array(X_tokenizer.texts_to_sequences(X_test))\n",
        "print (f\"X_train[0]: {X_train[0]}\")\n",
        "print (f\"len(X_train[0]): {len(X_train[0])} characters\")"
      ],
      "execution_count": 16,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "X_train[0]: PGA overhauls system for Ryder Cup points\n",
            "X_train[0]: [2013, 7327, 467, 5, 702, 118, 1137]\n",
            "len(X_train[0]): 7 characters\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "B3rTUi2jbN7e",
        "colab_type": "text"
      },
      "source": [
        "<img width=\"45\" src=\"http://bestanimations.com/HomeOffice/Lights/Bulbs/animated-light-bulb-gif-29.gif\" align=\"left\" vspace=\"5px\" hspace=\"10px\">\n",
        "\n",
        "Checkout other preprocessing functions in the [official documentation](https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/keras/preprocessing)."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "y8P2N42lZKW2",
        "colab_type": "text"
      },
      "source": [
        "# LabelEncoder"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "5bYN5NP1ZK0k",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "import json\n",
        "from sklearn.preprocessing import LabelEncoder"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "zhrsmbbYZLS5",
        "colab_type": "text"
      },
      "source": [
        "### Operations"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "CrWsgug6ZGLs",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "# Output vectorizer\n",
        "y_tokenizer = LabelEncoder()"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "Y7v4i94PZWgx",
        "colab_type": "code",
        "outputId": "482624cc-4009-46c1-c2f2-413ded987ab1",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 34
        }
      },
      "source": [
        "# Fit on train data\n",
        "y_tokenizer = y_tokenizer.fit(y_train)\n",
        "num_classes = len(y_tokenizer.classes_)\n",
        "print (f\"# classes: {num_classes}\")"
      ],
      "execution_count": 19,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "# classes: 4\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "N8me9sECZZUL",
        "colab_type": "code",
        "outputId": "450226d7-1511-4d38-dd91-8b9e98ed1cf5",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 51
        }
      },
      "source": [
        "# Convert labels to tokens\n",
        "print (f\"y_train[0]: {y_train[0]}\")\n",
        "y_train = y_tokenizer.transform(y_train)\n",
        "y_val = y_tokenizer.transform(y_val)\n",
        "y_test = y_tokenizer.transform(y_test)\n",
        "print (f\"y_train[0]: {y_train[0]}\")"
      ],
      "execution_count": 20,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "y_train[0]: Sports\n",
            "y_train[0]: 2\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "D3syfG98QSC1",
        "colab_type": "code",
        "outputId": "ca28df7b-5c84-4291-9711-b0c026d2978b",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 51
        }
      },
      "source": [
        "# Class weights\n",
        "counts = collections.Counter(y_train)\n",
        "class_weights = {_class: 1.0/count for _class, count in counts.items()}\n",
        "print (f\"class counts: {counts},\\nclass weights: {class_weights}\")"
      ],
      "execution_count": 21,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "class counts: Counter({2: 21675, 1: 21675, 3: 21675, 0: 21675}),\n",
            "class weights: {2: 4.61361014994233e-05, 1: 4.61361014994233e-05, 3: 4.61361014994233e-05, 0: 4.61361014994233e-05}\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "vuB6zUH_br7b",
        "colab_type": "text"
      },
      "source": [
        "<img width=\"45\" src=\"http://bestanimations.com/HomeOffice/Lights/Bulbs/animated-light-bulb-gif-29.gif\" align=\"left\" vspace=\"5px\" hspace=\"10px\">\n",
        "\n",
        "Checkout the complete list of sklearn preprocessing functions in the [official documentation](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing)."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "96ZOlhGKp-ZG"
      },
      "source": [
        "# Padding"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "3Fu2AmcJp-LC"
      },
      "source": [
        "Our inputs are all of varying length but we need each batch to be uniformly shaped. Therefore, we will use padding to make all the inputs in the batch the same length. Our padding index will be 0 (note that X_tokenizer starts at index 1)."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "fHAxXfkvp99r",
        "colab": {}
      },
      "source": [
        "from tensorflow.keras.preprocessing.sequence import pad_sequences"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "outputId": "e690241e-c3e8-41a3-d92f-76aa4ac1cdb3",
        "id": "Tz2zOt67p9wK",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 34
        }
      },
      "source": [
        "sample_X = np.array([[3, 89, 45]])\n",
        "max_seq_len = 10\n",
        "padded_sample_X = pad_sequences(sample_X, padding=\"post\", maxlen=max_seq_len)\n",
        "print (f\"{sample_X} → {padded_sample_X}\")"
      ],
      "execution_count": 23,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "[[ 3 89 45]] → [[ 3 89 45  0  0  0  0  0  0  0]]\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "nKne710fcSZR",
        "colab_type": "text"
      },
      "source": [
        "We will put all of these preprocessing utilities to use in the subsequent lessons."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Dh66tK82G2cL",
        "colab_type": "text"
      },
      "source": [
        "---\n",
        "<div align=\"center\">\n",
        "\n",
        "Subscribe to our <a href=\"https://practicalai.me/#newsletter\">newsletter</a> and follow us on social media to get the latest updates!\n",
        "\n",
        "<a class=\"ai-header-badge\" target=\"_blank\" href=\"https://github.com/practicalAI/practicalAI\">\n",
        "              <img src=\"https://img.shields.io/github/stars/practicalAI/practicalAI.svg?style=social&label=Star\"></a>&nbsp;\n",
        "            <a class=\"ai-header-badge\" target=\"_blank\" href=\"https://www.linkedin.com/company/madewithml\">\n",
        "              <img src=\"https://img.shields.io/badge/style--5eba00.svg?label=LinkedIn&logo=linkedin&style=social\"></a>&nbsp;\n",
        "            <a class=\"ai-header-badge\" target=\"_blank\" href=\"https://twitter.com/madewithml\">\n",
        "              <img src=\"https://img.shields.io/twitter/follow/madewithml.svg?label=Follow&style=social\">\n",
        "            </a>\n",
        "              </div>\n",
        "\n",
        "</div>"
      ]
    }
  ]
}